An Analysis of the Enron Email Dataset

Introduction to the Dataset

https://www.kaggle.com/wcukierski/enron-email-dataset

The used dataset was obtained by the Federal Energy Regulatory Commission during its investigation of the Enron Collapse. The dataset encompasses approximately 500,000 emails from over 150 employees of the Enron Corporation.

After a decade of prosperous growth for the company during the 1990s, Enron came under pressure from both shareholders and the company's executives to increase profits to their once high-numbers.

In order to attempt to hide expenses, and make profits seem higher than they actually were, executives relied on an accounting technique called "Mark-to-market". This is unfortunately illegal. In addition to this the company began to move its more risky and troubling divisions of the corporation to "Special Purpose Entities" (SPEs). This is also unfortunately illegal as it made Enron's losses look much less than they were in reality.

By mid 2001 a number of analysts had begun to dig into Enron's financial reports and uncover these massive discrepencies. First an internal investigation would be launched, then an investigation by the Security and Excange Commission (SEC), all while Enron's share prices continued to take from a mid-2000 high of 90 dollars to less than one dollar by the end of 2001.

On December 2nd, 2001, Enron officially filed for Chapter 11 Bankruptcy, along with this many executives were indicted. During the impending investigation did the emails that compose this dataset come to light.

Notes with the Dataset

Due to its peculiar nature, as will be shown below, I neither had traditional research questions nor normal numeric outputs for most of my code. Most of the challenge in working with this dataset came in transforming the monstrous 'message' column into a usable dataframe.

I chose this dataset by reccommendation from Dr. Bixler. I had originally intended to work with a dataset that was similar in structure but instead focused on spam in text messages. That one however, was quite short, and he directed my attention instead to this dataset.

Although it seems that there would be a considerable amount of ethical questions over the use of this dataset, I will simply choose to reflect those questions. This dataset was originally retrieved as part of an investigation by the SEC and other government bodies. This information only came to light as a response to their wrong-doing, so I really don't feel bad about going through these people's personal information.

Initialization of Dataset

This is a relatively simple dataset for a project like this, as it only has two columns. The difficulty however, comes in cleaning up the message column.

As shown above this column is composed of thousands of long, messy, strings. These strings contain nearly all the data about the email, including the ID, date, Sender, Reciever, and Subject, in a semi-organized manner.

Our challenge here is seperating all of these categories up into multiple columns, so that we are actually able to do some work with this datset.

Tidying

Its not perfect, some optimizing could be done with date or even some of the X-To elements, but most of the data has now been neatly split into columns for each of the categories laid out in the original message string.

Exploratory Analysis

Folder Analysis

Above I created a visualization of the various folders that our emails are placed into. We can see at the very top that we have a folder called Kay Mann June 2001 with over 6000 emails in it.

User Analysis

The most emails, as shown in the graph above, were sent by Kaminski V, only a short amount behind them were Dasovich J. I think the drop off in total email is quite intriguing as there are four people who make up over a fifth of this dataset. Considering how monstrous this dataset is in size I would say that is quite suprising.

Subject/Message Word Count Analysis

sns.pairplot(grouped_by_people.reset_index(), hue='user')

Although the above output does produce a desired set of graphs, it is plagued by a slew of error messages and is quite small and hard to see, so I have made it larger and cropped it below.

graph%20import.jpg

That's better.

In a series of graphs this analyzes the number of each persons emails along with their subject word count and message word count.

Network Analysis

Now for some cool stuff... cool stuff that kills my computer.

This graph visualizes the email network of users who have sent and received emails as individual nodes. Sadly, only the first 350 emails are displayed. Increasing this number further does two things: One it clutters the graph to the point where no discernable information can truly be gotten from it, and two it takes forever to construct.

While trying out this graph at a greater volume the Python task on my PC eventually began to eat up over 15 GB of ram before I just shut the operation down.

As is visible in the graph the central node is Allen P, and all of the other nodes trace back to him. This would of course change if another set of emails was examined, perhaps even leaving the graph with more than one center nodes.

Aditional Network Analysis

This is similar to the previous graph, in which it treats each user as a node and tracks it back to a central point. The central point throughout the first 1,000 emails is once again Allen P. This again would not stay the same if the full dataset had been visualized, instead this graph would turn into a giant messy black ball.

World Clouds

Finally, as more of a fun visualization than anything, I have constructed two word clouds depicted the highest usage words in both the subject and message body column.

Machine Learning

This output shows us four distinct clusters centered around four centroids by using the K-Means Clustering Algorithm. The clusters shown in yellow and green are much farther away from those of orange and blue on the graph, indciating a higher variance in their rseults.

The above output shows the most frequent topics retrieved from a Topic Model algorithm. As would be expected from a company like Enron, words such as gas, power, or energy all show up here. We also see a large amount of standard email ettiquette, such as please and thanks.

General Conclusions

As stated in the beginning of this notebook I did not believe that I had a set-in-stone array of research questions that I wanted to answer like many other people did with their projects. Through explorative analysis and toying with the dataset I was able to find questions and then subsequently answer them along the way.

Questions like:

- Who was the biggest sender and reciever of male at the company?
- What topics did these emails generally cover?
- What language was used the most often across all emails?
- How did Enron catalogue these emails in their system?

I feel that throughout my work with this dataset I have sufficiently answered each of these questions to my best capacity. If given more time to work with this dataset there are a number of things that I could imagine myself trying to do, including but not limited to:

Answering:

- Who is the largest person of interest, based not on sent/recieved count, but on language used
        I feel this would be interesting as this dataset was siezed as part of a criminal investigation.
- Putting in more work with date/time analysis of the datset. Although I had tried this for a while I could not get the modules fully working and would love to go back for another attempt if time were to allow. 
- Searching for members of the dataset who were eventually convicted/appeared often in the media, and seeing how their language differed from the rest of the crowd.
- Run more accurate tests, as although I am satisfied with my results, some were produced with a small sample size, due to the sheer total size of the dataset.